Perform text analysis.
Perform sentiment analysis or topic modeling using text analysis methods as demonstrated in the pre-class work and in the readings.
Do the above. Can’t think of a data source?
gutenbergrAssociatedPress from the topicmodels packageNYTimes or USCongress from the RTextTools packagedevtools::install_github(“bradleyboehmke/harrypotter”) ``- [State of the Union speeches](https://pradeepadhokshaja.wordpress.com/2017/03/31/scraping-the-web-for-presdential-inaugural-addresses-using-rvest/) - Scrape tweets using [twitteR`](https://www.credera.com/blog/business-intelligence/twitter-analytics-using-r-part-1-extract-tweets/)
Analyze the text for sentiment OR topic. You do not need to do both. The datacamp courses and Tidy Text Mining with R are good starting points for templates to perform this type of analysis, but feel free to expand beyond these examples.
We will spend the next 2 weeks working on analyzing textual data in R. You will do the following:
# sevenbook is a tidy text format dataframe including 7 novels.
sevenbook
## # A tibble: 409,338 x 4
## chapter word title series
## <int> <chr> <chr> <int>
## 1 1 boy philosophers_stone 1
## 2 1 lived philosophers_stone 1
## 3 1 dursley philosophers_stone 1
## 4 1 privet philosophers_stone 1
## 5 1 drive philosophers_stone 1
## 6 1 proud philosophers_stone 1
## 7 1 perfectly philosophers_stone 1
## 8 1 normal philosophers_stone 1
## 9 1 people philosophers_stone 1
## 10 1 expect philosophers_stone 1
## # ... with 409,328 more rows
** 1.1 What are top words in each book? **
# Top 10 words in each novel
top_words
## # A tibble: 70 x 3
## # Groups: title [7]
## title word n
## <chr> <chr> <int>
## 1 chamber_of_secrets harry 1503
## 2 chamber_of_secrets ron 650
## 3 chamber_of_secrets hermione 289
## 4 chamber_of_secrets malfoy 202
## 5 chamber_of_secrets lockhart 197
## 6 chamber_of_secrets professor 190
## 7 chamber_of_secrets weasley 157
## 8 chamber_of_secrets looked 155
## 9 chamber_of_secrets time 148
## 10 chamber_of_secrets eyes 145
## # ... with 60 more rows
# Plot the bar chart of top words
graph_top
# From the bar charts, we find that main characters are Harry, Ron and Hermione.
# And most common words are usually related to characters, such as "Dumbledore", "Hagrid", "Snape", "Uncle" and "Professor"...
** 1.2 What are common words in the novels after removing characters’ names? **
# After removing some common words related to the characters ("harry","harry's","potter","ron","hermione","dumbledore","snape","hagrid","weasley","voldemort","Malfoy","professor"), plot the top 10 words in each novel
no_char_graph
# The bar charts show that "looked", "eyes", "time", "voice", "head" (usually the words related to body) ... are in the top words.
# Calculate the proportion of word in each novel
words_prop
## # A tibble: 63,651 x 5
## title series word n proportion
## <chr> <int> <chr> <int> <dbl>
## 1 order_of_the_phoenix 5 harry 3730 0.03854222
## 2 goblet_of_fire 4 harry 2936 0.04040571
## 3 deathly_hallows 7 harry 2770 0.03773533
## 4 half_blood_prince 6 harry 2581 0.04090462
## 5 prisoner_of_azkaban 3 harry 1824 0.04428474
## 6 chamber_of_secrets 2 harry 1503 0.04470420
## 7 order_of_the_phoenix 5 hermione 1220 0.01260630
## 8 philosophers_stone 1 harry 1213 0.04243484
## 9 order_of_the_phoenix 5 ron 1189 0.01228598
## 10 deathly_hallows 7 hermione 1077 0.01467183
## # ... with 63,641 more rows
# Calculate the words' proportion by chapters in each novel
words_prop_chapter
## # A tibble: 215,433 x 6
## title chapter series word n proportion
## <chr> <int> <int> <chr> <int> <dbl>
## 1 chamber_of_secrets 19 2 harry 173 0.05158020
## 2 goblet_of_fire 31 4 harry 161 0.05439189
## 3 goblet_of_fire 26 4 harry 159 0.05033238
## 4 prisoner_of_azkaban 21 3 harry 153 0.05694083
## 5 goblet_of_fire 28 4 harry 152 0.05175349
## 6 order_of_the_phoenix 35 5 harry 149 0.04623022
## 7 order_of_the_phoenix 24 5 harry 147 0.04766537
## 8 half_blood_prince 18 6 harry 145 0.05228994
## 9 goblet_of_fire 20 4 harry 144 0.05801773
## 10 goblet_of_fire 23 4 harry 144 0.04551201
## # ... with 215,423 more rows
** 2.1 How do the proportions of the three main characters change along with the novels? **
# Plot the proportions of the three main characters in each book.
prop_book_graph
## The propotion of harry and ron slightly decreases with the series, while the proportion of hermione slightly increases. In the second book (chamber of secrets) there is a relatively big gap between the proportion of Ron and Hermione.
** 2.2 How do the proportions of three main characters change along with the chapters in each book? **
# Draw line plots of each novel to compare the proportion change
prop_chapter_graph
## For the fans of ron or hermione, they can find in which chapter the character has a relatively high proportion. For example, in the first book (philosophers stone), Ron and Hermione first appear in the 6th chapter.
** 2.3 How do the proportions of other characters change along with the novels? **
other_prop
# The line plot shows that the proportion of Hagrid goes down along with the novels. Overall, the proportion of Dumbledore goes up from 1 to 6 and it drops in the 7th novel.
** 3.1 What are common joy words and sad words in the seven novels? **
# Extract joy words from sentiment dataset NRC.
nrcjoy
## # A tibble: 689 x 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
## 7 accomplished joy
## 8 achieve joy
## 9 achievement joy
## 10 acrobat joy
## # ... with 679 more rows
# Use inner_join to perform the sentiment analysis.
joy
## # A tibble: 1,713 x 3
## title joyword n
## <chr> <chr> <int>
## 1 order_of_the_phoenix ministry 191
## 2 order_of_the_phoenix found 164
## 3 order_of_the_phoenix feeling 145
## 4 goblet_of_fire magical 129
## 5 goblet_of_fire ministry 115
## 6 half_blood_prince ministry 113
## 7 goblet_of_fire found 108
## 8 deathly_hallows ministry 96
## 9 half_blood_prince found 91
## 10 deathly_hallows found 87
## # ... with 1,703 more rows
# We can see in each novel, the common joy words is "found". Also, "magical", "hope", "smile"... are frequently used joy words in seven books.
joy_graph
# Extract sad words from sentiment dataset NRC.
nrcsad
## # A tibble: 1,191 x 2
## word sentiment
## <chr> <chr>
## 1 abandon sadness
## 2 abandoned sadness
## 3 abandonment sadness
## 4 abduction sadness
## 5 abortion sadness
## 6 abortive sadness
## 7 abscess sadness
## 8 absence sadness
## 9 absent sadness
## 10 absentee sadness
## # ... with 1,181 more rows
# Use inner_join to perform the sentiment analysis.
sad
## # A tibble: 2,559 x 3
## title sadword n
## <chr> <chr> <int>
## 1 order_of_the_phoenix harry 3730
## 2 goblet_of_fire harry 2936
## 3 deathly_hallows harry 2770
## 4 half_blood_prince harry 2581
## 5 prisoner_of_azkaban harry 1824
## 6 chamber_of_secrets harry 1503
## 7 philosophers_stone harry 1213
## 8 prisoner_of_azkaban black 332
## 9 goblet_of_fire moody 309
## 10 deathly_hallows death 305
## # ... with 2,549 more rows
# We can see in each novel, the common sad words is "black", "dark". Also, "kill", "bad", "leave", "death"... are frequently used sad words in seven books. If we use NRC to do the sentiment analysis, we will find something wierd, since "mother" is in both joy and sad words list.
sad_graph
# Check the word "mother" in NRC lexicon. We can see that "mother" can be different sentiment. So when analysis sentiment here, we should not take "mother" into account.
get_sentiments("nrc")%>%filter(word=="mother")
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 mother anticipation
## 2 mother joy
## 3 mother negative
## 4 mother positive
## 5 mother sadness
## 6 mother trust
** 3.2 How does the sentiment change along with the novels / chapters? Does it become more positive or negative? **
# 3.2.1 Compare the ratio of negative and positive words used in the seven books. Bigger ratio indicate more negative sentiment.
ratio_np
# The line graph shows that the ratio of negative and positive words fluctuates, a high ratio usually followed by a relatively low ratio in the next book, except that the ratio of "prisoner_of_azkaban"" is higher than "chamber of secrets".
# 3.2.2 How does the ratio change through chapters in each book?
ratio_chapter_np
# The line graphs of each book show that at the end of the story, the ratio of negative and postive words declines to a lower level, which means the story has a relatively "happy ending". Also according to the fluctuation of each book, we know the ups and downs of the sentiment. For example, in the half blood prince, there is a peak of negative sentiment in chapter 29.
# Create a tidy text format that record the line number of each word.
series
## # A tibble: 409,485 x 5
## chapter linenumber word title series
## <int> <int> <chr> <chr> <int>
## 1 1 1 boy philosophers_stone 1
## 2 1 1 lived philosophers_stone 1
## 3 1 2 dursley philosophers_stone 1
## 4 1 2 privet philosophers_stone 1
## 5 1 2 drive philosophers_stone 1
## 6 1 2 proud philosophers_stone 1
## 7 1 2 perfectly philosophers_stone 1
## 8 1 2 normal philosophers_stone 1
## 9 1 3 people philosophers_stone 1
## 10 1 3 expect philosophers_stone 1
## # ... with 409,475 more rows
# Use Bing lexicon to analyze how sentiment changes along with sections. Here sentiment = positive-negative.
series_bing
## Usually, there are more negative words in each section.
# Use AFINN lexicon to analyze how sentiment changes along with sections. Here sentiment=sum(score).
series_afinn
## The results seem to be more reasonable by using AFINN lexicon, because the AFINN lexicon has the score of each word.
# Take philosophers stone as an example to examine how sentiment changes throughout the chapter - bing
sentence_sent
## # A tibble: 141 x 5
## chapter index negative positive sentiment
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 16 13 -3
## 2 1 1 26 14 -12
## 3 1 2 11 9 -2
## 4 1 3 12 4 -8
## 5 1 4 15 13 -2
## 6 1 5 22 17 -5
## 7 1 6 14 20 6
## 8 1 7 14 20 6
## 9 2 0 16 12 -4
## 10 2 1 19 17 -2
## # ... with 131 more rows
stone_graph
# We can see in which chapter there are more sections that have more positive sentiment, such as chapter 5, 6, 7.
sevenbook%>%
count(word)%>%
with(wordcloud(word,n,max.words=100))
# Throughout the seven books, according to the wordcloud, we also get the main characters are "Harry", "Ron", "Hermione", "Dumbledore" and "Hagrid"... And some common words are "looked", "time", "magic", "eyes"...
** Find the most common positive and negative words **
sevenbook%>%
inner_join(get_sentiments("bing"))%>%
count(word,sentiment,sort=T)%>%
acast(word~sentiment,value.var="n",fill=0)%>%
comparison.cloud(colors=c("#F8766D", "#00BFC4"),
max.words=50)
## Joining, by = "word"
# From the word cloud, we find that the most common positive words throughout the series are "magic", "top", "happy", "gold", "love", "nice"... And the most common negative words are "dark", "fell", "hard", "death"...
# Examine the most common bigrams
bigram_n
## # A tibble: 523,420 x 3
## # Groups: title [7]
## title bigram n
## <chr> <chr> <int>
## 1 order_of_the_phoenix of the 1192
## 2 deathly_hallows of the 1002
## 3 goblet_of_fire of the 901
## 4 order_of_the_phoenix in the 872
## 5 half_blood_prince of the 707
## 6 order_of_the_phoenix said harry 689
## 7 deathly_hallows in the 673
## 8 goblet_of_fire in the 673
## 9 order_of_the_phoenix at the 607
## 10 order_of_the_phoenix on the 603
## # ... with 523,410 more rows
# The most common bigrams are some we are not interested in, such as " of the ", "in the ". And most of them are in stop words.
# Remove cases where either is a stop-word
# new bigram counts
bigram_counts
## # A tibble: 89,120 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 professor mcgonagall 578
## 2 uncle vernon 386
## 3 harry potter 349
## 4 death eaters 346
## 5 harry looked 316
## 6 harry ron 302
## 7 aunt petunia 206
## 8 invisibility cloak 192
## 9 professor trelawney 177
## 10 dark arts 176
## # ... with 89,110 more rows
# We can see that names are the most common pairs in Harry Potter series.
# The table shows the number of occurence of any 2 characters among "Harry", "Ron" and "Hermione"
character_relationship
## # A tibble: 6 x 4
## word1 word2 n rank
## <chr> <chr> <int> <int>
## 1 harry ron 302 6
## 2 ron hermione 84 33
## 3 harry hermione 59 63
## 4 ron harry 54 71
## 5 hermione harry 35 143
## 6 hermione ron 23 249
# Harry and ron usually appear together.
# Also, Ron and Hermione usually appear together.
# Unite and analyze
bigrams_united
## # A tibble: 107,016 x 4
## title series bigram n
## <chr> <int> <chr> <int>
## 1 philosophers_stone 1 uncle vernon 97
## 2 philosophers_stone 1 professor mcgonagall 90
## 3 philosophers_stone 1 aunt petunia 52
## 4 philosophers_stone 1 harry potter 26
## 5 philosophers_stone 1 harry looked 22
## 6 philosophers_stone 1 professor dumbledore 20
## 7 philosophers_stone 1 professor quirrell 18
## 8 philosophers_stone 1 hermione granger 16
## 9 philosophers_stone 1 privet drive 16
## 10 philosophers_stone 1 professor flitwick 15
## # ... with 107,006 more rows
# And Professor Mcgonagall is a common character in Harry Potter. From the plot, we find that in the book "Order Of the Phoenix" , the frequency goes up.
united_graph
# We find that in goblet_of_fire, Harry and Ron usually appear together.
bigram_harry
## # A tibble: 8,566 x 4
## title series bigram n
## <chr> <int> <chr> <int>
## 1 goblet_of_fire 4 harry ron 86
## 2 order_of_the_phoenix 5 harry looked 76
## 3 deathly_hallows 7 harry looked 60
## 4 goblet_of_fire 4 harry looked 58
## 5 order_of_the_phoenix 5 harry ron 54
## 6 prisoner_of_azkaban 3 harry ron 49
## 7 deathly_hallows 7 harry ron 40
## 8 half_blood_prince 6 harry looked 36
## 9 half_blood_prince 6 harry ron 34
## 10 chamber_of_secrets 2 harry looked 33
## # ... with 8,556 more rows
# Analyze sentiment associated with Harry with "AFINN"
harry_sentiment
## # A tibble: 447 x 3
## word score n
## <chr> <int> <int>
## 1 yeah 1 47
## 2 reached 1 29
## 3 dear 2 26
## 4 lied -2 24
## 5 laughed 1 22
## 6 feeling 1 21
## 7 bitterly -2 19
## 8 fire -2 19
## 9 stopped -1 18
## 10 nervously -2 17
## # ... with 437 more rows
# The figure shows the common positive and negative sentiment words associated with "Harry".
harry_graph
** network of bigrams **
# Filter for only relatively common combination ( the occurrences of the 2 words are more than 60 )
bigram_graph
## IGRAPH cc2950d DN-- 85 60 --
## + attr: name (v/c), n (e/n)
## + edges from cc2950d (vertex names):
## [1] professor ->mcgonagall uncle ->vernon
## [3] harry ->potter death ->eaters
## [5] harry ->looked harry ->ron
## [7] aunt ->petunia invisibility->cloak
## [9] professor ->trelawney dark ->arts
## [11] professor ->umbridge death ->eater
## [13] entrance ->hall madam ->pomfrey
## [15] dark ->lord professor ->dumbledore
## + ... omitted several edges
# From the figure, we can visualize relational tidy data of the Harry Potter. The figure corresponds to the table (bigram_counts) we get. "Professor Mcgonagall", " Uncle Vernon "... are common combinations.
network